Succinct Suffix Arrays Based on Run-Length Encoding

نویسندگان

Veli Mäkinen

Gonzalo Navarro

چکیده

A succinct full-text self-index is a data structure built on a text T = t1t2 . . . tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2 . . . pm in T , and is able to reproduce any text substring, so the self-index replaces the text. Several remarkable self-indexes have been developed in recent years. They usually take O(nH0) or O(nHk) bits, being Hk the kth order empirical entropy of T . The time to count how many times does P occur in T ranges from O(m) to O(m log n). We present a new self-index, called run-length FM-index (RLFM index), that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The index requires nHk log2 σ +O(n) bits of space for small k. We then show how to implement the RLFM index in practice, and obtain in passing another implementation with different space-time tradeoffs. We empirically compare ours against the best existing implementations of other indexes and show that ours are fastest among indexes taking less space than the text.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Counting Suffix Arrays and Strings

Suffix arrays are used in various application and research areas like data compression or computational biology. In this work, our goal is to characterize the combinatorial properties of suffix arrays and their enumeration. For fixed alphabet size and string length we count the number of strings sharing the same suffix array and the number of such suffix arrays. Our methods have applications to...

متن کامل

Time and Space Efficient Lempel-Ziv Factorization based on Run Length Encoding

We propose a new approach for calculating the Lempel-Ziv factorization of a string, based on run length encoding (RLE). We present a conceptually simple off-line algorithm based on a variant of suffix arrays, as well as an on-line algorithm based on a variant of directed acyclic word graphs (DAWGs). Both algorithms run in O(N + n log n) time and O(n) extra space, where N is the size of the stri...

متن کامل

Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

MOTIVATION Over the last few years, methods based on suffix arrays using the Burrows-Wheeler Transform have been widely used for DNA sequence read matching and assembly. These provide very fast search algorithms, linear in the search pattern size, on a highly compressible representation of the dataset being searched. Meanwhile, algorithmic development for genotype data has concentrated on stati...

متن کامل

Succinct representations of lcp information and improvements in the compressed suffix arrays

We introduce two succinct data structures to solve various string problems. One is for storing the information of lcp, the longest common prefix, between suffixes in the suffix array, and the other is an improvement in the compressed suffix array which supports linear time counting queries for any pattern. The former occupies only 2n + o(n) bits for a text of length n for computing lcp between ...

متن کامل

Improving Text Indexes Using Compressed Permutations

Any sorting algorithm in the comparison model defines an encoding scheme for permutations. As adaptive sorting algorithms perform o(n lg n) comparisons on restricted classes of permutations, each defines one or more compression schemes for permutations. In the case of the compression schemes inspired by Adaptive Merge Sort, a small amount of additional data allows to support in good time the ac...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Nord. J. Comput.

دوره 12 شماره

صفحات -

تاریخ انتشار 2005

Succinct Suffix Arrays Based on Run-Length Encoding

نویسندگان

چکیده

منابع مشابه

Counting Suffix Arrays and Strings

Time and Space Efficient Lempel-Ziv Factorization based on Run Length Encoding

Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT)

Succinct representations of lcp information and improvements in the compressed suffix arrays

Improving Text Indexes Using Compressed Permutations

عنوان ژورنال:

اشتراک گذاری